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ABSTRACT 



This mechanism relates to a method within the area of 
information mining within a multitude of documents stored 
on computer systems. More particillarly, this mechanism 
relates to a computerized method of generating a content 
taxonomy of a multitude of electronic documents. The 
technique proposed by the current invention is able to 
improve at the same time the scalability and the coherence 
and selectivity of taxonomy generation. The fundamental 
approach of the current invention comprises a subset selec- 
tion step, wherein a subset of a multitude of documents is 
being selected. In a taxonomy generation step a taxonomy is 
generated for that selected subset of documents, the tax- 
onomy being a tree structured taxonomy hierarchy. More- 
over this method comprises a routing selection step assign- 
ing each unprocessed document to the taxonomy hierarchy 
based on largest similarity. 

31 Ciaims, 4 Drawing Sheets- 
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TAXONOMY GENERATION FOR 
DOCUMENT COLLECTIONS 

FIELD OF THE INVENTION 
The present invention relates to a method within the area 
ot mformation mining within a multitude of documents 
stored 00 computer systems. More particularly, the invention 
relates to a computerized method of generating a content 
taxonomy of a multitude of electronic documents. 

BACKGROUND OF THE INVENTION 
Oiganizations generate and coUect large volumes of data 
which they use in daily operations. Yet many companies are' 
unable to capitalize fuUy on the value of this data because 
mformation implicit in the data is not easy to discern 
Operational systems record U^sactions as they occur day 
and s^ore the transaction data in files and data- 
bases. Documents are produced and placed in shared files or 
m repositories provided by document management systems 
The growth of the Internet, and its increased worldwide 
acceptance as a core channel both for communication among 
mdividuals and for business operations, has multiplied the 
sources of mformation and therefore the opportunities for 
obtaming competitive advantages. Business Intelligence 
Solutions is the terai that describes the processes that 
together are used to enable improved decision making 
Information mining is the process of data mining and/or text 
mimng. It uses advanced technology for gleaning valuable 
msights from these sources that enable the business user 
making the right business decisions and thus obtaining the 
competitive advantages required to thrive in today's com- 
petitive environment. Information Mining in general gener- 
ates previously unknown, comprehensible, and actionable 
information from any source, including transactions 
documents, e-mail, web pages, and other, and using it to 
make crucial business decisions. 

Data is the raw material. It can be a set of discrete facts 
about events, and in that case, it is most usefiiUy described 
as structured records of transactions, and it is usually of 
numeric or literal type. But documents and Web pages are 
abo a source of an unstructured data, delivered as a stream 
of bits which can be decodified as words and sentences of 
text m a certain language. Industry analysts estimate that 
unstructured data represent 80% of an enterprise information 
compared to 20% from stmctured data; it comprises data 
from diffcrentsouroBS, such as text, image, video, and audio; 
text, IS however, the most predominant variety of unstruc- 
tured data. 

The IBM Intelligent Miner Family is a set of offerings that so 
enables the business professional and in general any knowl- 
edge worker to use the computer to generate mcaningfiil 
mformation and useful insights from both stmctured data 
and text. Although the general problems to solve (eg 
clustermg, classification) are similar for the different data' 55 
types, the technology used in each case is different, because 
It needs to be optimized to the media involved, the user 
needs, and to the best use of the computing resources. For 
that reason, the IBM IntelUgent Family is comprised of two 
specialized products: the IBM IntelUgent Miner for Data ^ 
and the IBM Intelligent Miner for Text. ' 

Information mining has been defined as the process of 
generatmg previously unknown, comprehensible, and 
actionable mformation from any source. This definition 
e^qDoses the fundamental differences between information 55 
mining and the traditional approaches to data analysis such 
as query and reporting and online analytical processing 



20 



25 



(OLAP) for stmctured data, and from full text search for 
texmal data. In essence, information mining is distinguished 
by the fact that it is aimed at the discovery of information 
and knowledge, without a previously formulated hypothesis 
5 By definition, the information discovered through the min- 
ing process must have been previously unknown that is. it 
IS unlikely that the information could have been hypoth- 
esized in advance. For structured data, the interchangeable 
tenns data mining" and "knowledge discovery in data- 
10 bases" describe a multidiscipHnary field of research that 
mclude machine learning, statistics, database technology 
rule based systems, neural networks, and visualization 
Text mmmg** technology is also based on different 
approaches of the same technologies; moreover it exploits 
15 techniques of computational linguistics. 

Both data mining and text mining share key concepts of 
knowledge extraction, such as the discovery of which fea- 
tures are important for clustering, that is, findiim groups of 
similar objects that differ significantly from other objects 
They also share the concept of classification, which refers to 
finding out to which class it belongs a certain database 
record, m the case of data mining, or to a document, in the 
case of text mming. The classification schema can be 
discovered automatically through clustering techniques (the 
machine finds the groups or chisters and assigns to each 
cluster a generalized tiUe or chister label that becomes the 
class name). In other cases the taxonomy can be provided by 
the user, and the process is called categorization. 

Many of the technologies and tools developed in infor- 
maUon mming are dedicated to the task ol discovery and 
extraction of information or knowledge from text 
d(xniments, called feature extracUon. The basic pieces of 
mformation m text— such as the language of the text or 
company names or dates mentioned— are called features 
Information extraction from unconstrained text is the extrac- 
tion of the linguistic items that provide representative or 
otfierwisc relevant information about the document content 
These features are used to assign documents to categories in ' 
a given scheme, group documents by subject, focus oh" 
specific parts of information within documents, or improve 
me quality of mformation retrieval systems. The extracted 
features can also serve as meta data about the analyzed 
documents. Extracting implicit data from text can be inter- 
eslmg for many reasons; for instance: 

to highlight important information e.g. to highhght impor- 
tant terms in documents. This can give a quick impres- 
sion whether the document is of any interest, 
to find names of competitors c.g. when doing a case study 
m a certain business area one can do a names extraction 
on the documents that one has received from different 
sources and then sort them by names of competitors, 
to find and store key concepts. This could replace a text 
retrieval system where huge indexes are not appropriate 
but only a few key concepts of the underlying docu- 
ment collection should be stored in a database, 
to use related topics for query refinement e.g. store the key 
concepts found in a database and build an appHcation 
for query refinement on top of it. Thus topics that are 
related to the users' initial queries can be suggested to 
help them refine their queries. 
Feature extraction from texts, and the harvesting of crisp 
and vague mformation, require sophisticated knowledge 
models, which tend to become domain specific. A recent 
research prototype has been disclosed by J. Mothe, T. Dkaki 
B. Dousset, "Mining Information in Order to Extract Hidden' 
and Strategic Information", Proceedings of Computer- 
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3lSSu«"™9^'°° '^"^"^^^ °" PP '^P-Wemofcoherenceandselectivityaheleafnodesin 

Afiirther technology of major importance in information IT.w^"?"^ ''^ """^'^y "h^rent with all 

mining is dedicated to the task of clustering of documents *5Si©ied documents havuig the same thematic content. 

Within a coUection of objects a cluster could be defined as < -.u- 6°™ dififerent nodes should appear 

a groi^ of objects whose members are more similar to each J^"^ r"** '^'stance in the taxonomy structure, 

other than to the members of any other group. In information . u i""?' ™P°rtanl problems of the current state of the art 

mimng clustenng b used to segment a document coUection 'ccMologies for taxonomy generation are 

riSir*- ^ ^^f'^ters, with the members of each cluster the problem of scalability: any document of the mll«.f;nn 

being sunUar with resjject to certain interesting features For must be assigned to ^melMfn^, •„.»«. 

^^.^^^^^^''^'^"'^'''''''"^^ "'V^' f, T^'-^' laiger^nS^^roTSi! 

. .. . connected by hyper links- an analysis of the given document coDection to steer 

to c<«e the process of browsing to find similar or related should bfSd"'' generation , process 
mformation e.g. to get an overview over documents- 

2^^^^^""^" documents in an ^ OBJECTIVE OF THE INVENTION 

Typically, the goal of chister analysis is to determine a set The invention is based on the objective to improve the 

ot ctastera. or a chistering, in which the inter-cluster simi- f<»l«*>ihty of an taxonomy generation process allowina a 

Ur^is minimiz^ and mtra-cluster similarity is maximized. ««onomy generation method to cope with increasing num- 

nffiSdiff^Ln.fl or best solution to this task. A « bers of documents to be analyzed in a reasonable amount of 

clusters have no genuine relationship, the clusters in a jn SUMMARY OF THE INVENTION 

hierarchical approach are arranged in a diisterine tree where n,» 

related clusters occur in the sa^ branch of^^ aS! cJtlT '^^"^"^ " "^'^ generating a 

tering algorithms have a long tradition ExamSL and ov^ taxonomy of a multitude of documents (210) stc^ 

views of chistering algorithms may brfoundln M "^"'^ ""'"S *««'^»We by 

Iwayama, T. Tokunaga, "Quster-Based Text Cateeorizati™- "'""Pu'er system. The fiindamental apRroach of die current 

AC^mparisonofCategprySearchStratei-jrp^^ " XTf '"■r'f' (201). wherein a 

mgs of SIOIR 1995, pp 27^280, July 1995 ACM oS °^ documents is being selected. In 

M««k,A.J.>^r:^Librakan-sAss^;i?lu?oT^^^ ™ 7/"^?^'''?'^? ^"S) alaxonomy is 

cally oiganizing on-line books into dynamic C^hLlves" f ZTt documents, said Jax- 

. m: ftoceedings of RIAO '94. Intelhgent MuSk lR' - h"^/.'-''"'^**^ taxonomy-hierarcby Said . - 

Systems and Management. N Y 1994 « subset is divided mto a set of clusters with largest intra- 

A further technology of major' importance in information • *^ Tf °t '^'^ °^ ^"S^ intra-similarity 

mimng is dedicated to the talk of categorizaLn ofdo^ out« Ser "^'^ ««onomy-hiera.chy a^ 

ments. In general, to categorize objecte means to as^Si are orS^nnt ^ »°^';n«»«.of taxonomy-hieraichy 

Aem to predefined categories or classes from a taxo^AT ,n7n f ^"^f ' '**^« ^'^ °"'«r «='"s'ej. 

TtecategoriesmaybeoverlappingordistincrdepldSon Z^Z^r-r'^'^^'^I^''^''^-^ '^'^ 

the domain of interest. For teit mining, categorfaanbn c^ IIT^ ^' ^^'^^ ^ comprises a muting- 

mean to assign categories to documtts of too gaui^ ^^nf^^ ^^'.'^^t't^ ^ unprocessed docu- 

documents with respect to a predefined oiganiz^tiof sXit^l? '^•r*'* belonging to said 

egorizatioD in the context of text mining means to as^en „ h T^'* outer-clusters are computed 

documents to preexisting categories, som^etimes cille^^- '° toonornvJi^r' " ""'^''."^ ' '"^^""^^ "^^^^^ 
ICS or Uiemes. Tlie categories are chosen to match toe ^3rt ^ composing the outer-cluster with larg- 
mtended use of the collection and have to be trained before! '^^^'"'V; 

hand. By assigning documents to categories, text minine can • '*'''"'"J"' proposed by the current invention is able to 
help to organize them. While categorization cannot replace „ ^'^^ scalability and the coherence 

the kind of cataloging a librarian does, it provides a much °j f k^Uvity of taxonomy generation. Scalability is pro- 
less expensive alternative. ^s the taxonomy generation step, being the most lime 
State of the art technologies for taxonomy generation '=°,'^"™°e part of the overall process, is operating on the 
suffer several deficiencies, Uke: selected subset of documents only This approach alone 
the problem of navigational balance: the taxonomy must ,o luftZ^J'ff^u °^"all problem. It's 
be well-balanced for navigation by an end-ukr In ^ ' 1 T *^^'"r«« «f ^ claim 1 that the 
particular, the fan-out at each level of the hierarchy. ^.Zf f 1 * '^^^'nable selected and reasonable sized 
must be limited, the depth must be limited, and the,; ' "^ct "^^"^ ' '"^^ '"""•'■"V 
must not be empty nodes. "*P*=" '» «'mplete multitude of documents. The intro- 
the problem of orientation: nodes in the taxonomy should 6, n^T^Z^ ' " u '""'^"^ «««s 
reflect "concepts" and give sufficient c^enZn for f very efficiently in an already 
user traversing the taxonomy '^'"P"'!'' ^^^fy- By exploiting a hierarchical taxonomy 

approach the leaf nodes in the taxonomy are coherent with 
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^irZ'^.^'^.°^'''^°'''''"*^^«''«"b«'°f documents , unS^^ '^^"^"'y calculations of an 

fa^^v ^ '"J"''"*^ "'^^ «P 'he ,u,^^f\'^""'^^ accomplished by the aspect 

tooDomy gpDcraUon process. As a result the complete '° ^"l fi^t-feature-extraction-step and/or said 

^ZT^r"'^?P'°**^'"^'y^"'°°>'«i<='«ddKt f"^f-e«^action-step extract features onSl' 

require any human intervention or adaptation a™>ities withm said documents. 

th.t^iw '"'^'""'S** "e accompUshed by the aspect ^^""'"^ ^^""y technology allows the proposed 

Aat said taxonomy-generation-step comprises a fi^ '^"^ to determine (in a domain iadependent mS 

feahuesrirtracdoa-step (202) extracting for each documSt " Phrases which have a muchSer s^Sc 

1^ ''^/''^;.'^.'°'''=°'°P"'^«i'«f"'"«statisl!« ^e«;^g compared to thesingle terms. TTius ori^nteu'Sn for 

m a featu,* vector (212) as-a n^resentation of said doc^ ^^pj^'^^'^'"^ ^ ««««»«'y is able toTfle« 

Introducing a distinct feature extraction step increases th!^^^^^ advantages are accomplished by the aspect 

f^f ??ff°^'''r'''^'''""'''°^'^">'ecomt';<^^K » ftLrli.^fT"''''"''°i'-^«P '^"^ 

exploit different feature extraction technoloaes de™nAW J««^-«'rtraction-step extract features based on Kncuistic 

on the intended purpose of the taxonomy, Sd^g^^^f Soi^'S 

d.xmment domain and depending on the characterlto of po2d So^HJ? features technology aUows the pro- 

thevanous feature extraction technologies. Storing the tto^. ^r?^ml^ determme (m a domain independent 

consuming s computation of the feature-vectors sSS, " ^Si^^F^^^' ""jS^J^ti""^. locations/domain 

processmg as the feature-vectoK can be used aJaSter S^eS Di^St^T^r'^ ^'Snificaot phrases from 

processing steps. Buniniaier "„ 'XttZ. ^ vanants are associated with a sinele 

Ticor, u- *"n"ariiy. words to identify co-occurrine words 

hierarchical cbZIS^tit^i^^l T'^''"' .k '^'''^'^ "^'"^''^ accomplished by the aspect 

..J^a..ser..r.ycrsing.LlSoX'S"LX^^^^^^ £q:^;'?o^^o?''^;:'^^°^^^^^^ 

nodes representing more abstract concept to Tower ^ Sd ' '"^ fr*'""'^. •« 

^o^^ali^n.'^d'vi^^;:*"'^'^ *cument Using this approach the proposed method is able to 

Additional advanuges are accomplished hv ,h. . Matures which a high selective property 

^^^^^^^^^^ . HS"Xy7octr.2^4°-^^^ 

feature-statistics Ts^d'S^^^^^^^ thafSeShtrr ^y tbe aspect 

said feature-vectors of each document of said cl^ter itlls bySril ''L^^°"°°'''-^'^'*"=^y "»^'ed Tl 

Bycombiningfhefeature-vectoisofthedocumenteform ktolmm^n^^? T^'°*'S'"S most similar clusters 

:^Lme"reflt t^'T '° ^ ^^^^ 'i-.tiThfclg ™: prLTS.'^''^^ —y'^erarchy com- 

which speeds up similarity comparison of m unw^d torlTJ'2l ^f^^^'^'^ '^^^'^ °^ abstraction, 

document in the routing-selection^ep sign^cTlT^ " naSr^hfr T'"^^"' 

Additional advantages are accomplished bvT « . ?k * taxonomy hierarchy is provided, 

.hat said routing-step <^mprisesrj"SatmeixtrXn ^thSwv ^ the 

step extracting for each of said unprSd7oLm« L Ja^^e f^L <t/'''°'''' ^ """^ """t*^ 

features and computing its feature S^Li^r^ Vi-^ ! " ^ ^«P^^««"« ^ reasonable trade-off. 

vector asarepresentationofS^p^SSdoc^menTTo' th.f T."' accomplished by the aspect 

said routing-step said similarities beSSd unproSid ^1 amZ^r'^T'^^"^"^ ^paj , labe^ 

document and each of said outer-clusters is comput^^J La2. h ^ '»-°'«"^y-hicrarchy' 

comparmg said feature-vector of said unprocessed doc^^ for .Srf f '*''°"°"'y ^ ''dvanUge 

ment with said category-scheme of said cluster °^ generated taxonomy. ^ 

Based on above approach similarity calculations of an ih^tft^w""' fl''*°'»«« accomplished by the aspect 

unprocessed docmient with respect J an outer-c^^ i^ Jltmy.rrl^STitr 
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If fhey happen to occur logclher ST' ' "n** acro^ da a fmm ^"'^ P*™"'' «»mparisons to be 

•mncd by randon, selection. * '^""^ :o rarely have s.n>Se™^ S^'^^'^f'*'',^^^^^ Documents 

Random selection is a verv cronH . u * ^ frequently focusTd on do '^''^^^ ''o. 

f«ire that the subset, belg '^^'^B comentWrte^ m^^"'"' ""^e^ 'ban 

includes docun^ent; c^l^^^^'J':?^^'^ *^'«^'^y. is extract fr^'^e^^' ""^^ "ata about docu- 

cvant features. statistically most rel- base where it may^ "mi '^^ ^'^'^'^ » data 

Additional advantaces .„ " """^^g techniques The me..!?,. "^""^ data 

•hat the range of T^nT^^""^''!^.^. aspect content of d^J^enJnotls."^-^ ' "^"^ ="rich 

sized sub-r^ges and M^!nH 1^ '^'^'^ '° ^"^^ '^'^ -"ining sofW .h"!!^"*' °" but by the 

separately for^do^e^ ^ P«=rf°™ed fining techniqte feVZ^et^'Z""''"^'"' ^= 

sub-ranges. ^ ""^ dates from said "Jogies to the im^e^d e^"," "^-^ ""^'hod- 

Using the proposed a,,„m k » text by an automatedprS^L^^ °f «'°^«d 

the fe.S,s u^Ti7.t H^''""^ <*^"g« of ''^hing documente. ^ ""^'^ '''^ctured data 

terminology"), which f^mrl "o^ments ("evolution of '^ost organizations have laro* • 

said documents. P 'o of sard multitude of '■"""'et documents such a<= 

According to the proposed methodoto .u , ^^^^ ^n^body connotate exS' '"''^"""^'^ 

>^-^olZT.:lt^^[Ztt'''''''^^^^^^ ^^-Tann^a^JreslrcT^^ 

BR EF DESCRIPBON OF TOE DRAWINGS - Extract key infoS.Sntrit '""'^ '° 

geoertdtThe' KSiL?r°'*.'^° of a dendrogram "'^''^ -"i-t 

proS met^T"^^^°f"»P«>«*«archi,ectureof th. queries "«ng powerful and flex- ' 

HG. 4 shows «, cx«nDle of . ^ , data (such as text) ""^awed 

taxonomy for the cuffenSL,lfu°^J °^ 8«"«ated 45 stoied onliw tLI^„i '^'^ data type 

fe.<«,e4,Ld taxo^; *^"""^"'**"' linguistic. more e^tiJ^ u^ o? ' ^^'^ oPPot-nity: to mS 

DESCRIPnoN OF ^EPREI^R,^ ^-Sj^l St t^p^^^^"'^'^ 

EMBODIMENT designed to 1^ used fv ^''^"'f '«« is that it is no7 

IRM ''"'^"Mon is illustrated on ,h k • information typicX stoL T l""- 'ahular 

0.0 Introduction 55 the text by L^nZT '°'°""*««"' was en«Sed in 

storeddaZTf t^klTT"' infrastructure of as "So" Miner for Text can b. 
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0.0^1 Text Analysis Functions '° addition, it can then aaign a stali,h«i c 

Functions in this area analv« tr,. .„ i - ^ measure to each vocabularv ifTm significance 

furtherprocessing. T^ry can b^«.^ K ^ ^""""'^ "'nfonnation oSnr^i/^? 

1.01.1 Language^den1fi«t„^'^'"'^^^^ --|g-«rd 'o every%Sat 1^^ ^^i/^^'^ 7*?=^^ 

clues in the dSent^s^Jn J^h""'"^,'" ^"^"^ " 1" ""^ "'S^" ca]^]ifomt^"°L"°"*^='^'»^ 
»ndif ,hedocnmemr,SiSl^/'^^^ l^°g"«g«s, ''Z'^'''^^' measures which tether Iru^,^''"?'"'^'^"'' 
the approximate nro^^n f ^ languages it determines °^ " ^ord. phrase or name within fhT5^ * significance 

new set of lan|uaf« U. l completely Extraction 

honsof the language icte„tfficationtoScLe a„.'^''^^^" o«=urrenc^ of „^^est ,".T 

the process of oigamangcolIectionron^ K,^?^^ ^0 type of entity ihe nZf r what 

kmguage; «slricting searchlSby iL^'S' '^"•"^ "^'^tion. o^-other" such a's'^' '°-?"«"'. Place. 

documeas to language translato^ ^ """""B TT.e mo;iule pS^ei^" eiL T"^"' 

0.ai^ Feature Extraction collection of docu^ie^^or at^r. Jf^^ ''*^'"« °^ » 

Exanq,lcsofthVvoLbErundT, f 7 " ^ataba^ of lam^"^ "produces a dictioaary. or 

^'-^^^nofnn.nci^^J:'^^^Zl^^-^nin Name'elXi^Sn^th^L^"- .r 

"tl^^^'ar"^-^^^^^ B.-. facihty. o^'^raH ff^^^^^ 

Restauran ^LTch r ' ^^""^ '° »'g^«io^ lo«tiot°^ °f People. 

Commiss ^n end\lar.T°^"*^ ^""^^g otheSS woSf "^h'*?^ ('°"'«-^««1 «enL) 

Assurance. of £ com^"'"'?"""^ ^«'"'«ExtraStoowt,!"', phrases fiom texts. We 

Equipment. commerdJ'barSfS'' ^'!^' ''T't S^^enSSs'^^'r 

debtor country Chronar • ^^"^ CMOs. and the whole collection a ^.-^ * documents 

deposits, Dan^: ^Stttf T""^' ?f^^«"*« °f vSa^^rf 1"*'°^^ 'W'^ tool is that 

First Bo^on. CCc S^^±^''; ^^"gl^ canonicaTform ,h™ ^ ""l^'^i^g i^n' with a 

Lyonnais ' "ote. Credit during furUier prS^^g " "^S'" 

pre^eLri^evtr.^etTat--*'' ""^''^'^'^ ^ - " ''^-idtSn''^^^^^^^^^ 

names and other i^Sd^r,,""T''/''°^ ^ ^ ^'«g°i^«d as^ng to .h^^f ''^^^^ 

quality and in fact co^oS^.Sv ,n ^""1"^ "'Sb °f "ames is^^'.t'!!J*'^°- Each such group 

vocabulary used in the domai^^f h characteristic Clinton") to distinShTf^mTrr ' "S"' 

analyzed. In fac what is S k to » "^""^^f being other entities ("a^^jTn^' 8^°"Pf ^«=fi='^ing to - 

vocabulary in which conoLf^^ '''g''* ^ ™ost explicit lcasr«m^,w ^' """"'cal name is the 

expresse7ra:t2eS°P^,^|r2.«'h-ollectionare 45 different'S^iu fo^'^''^^ fi"" 'he 

-ng.ech.qu..m«ngrir-^^ thr^-^fi^^^^^^^^ 

Multiword tirms (term extraction) "'C' Account""'' " 

Abbreviations ing^alb^Vf^^^Tr^.^ ^'^"^ ^ P-'^'^*- 

vocabulary in the document wCh '''' " ^ecau^ of the 2b.W ^ "^"^ " challenging 

which it haspreviousTSm?rSa,;n7''f ' '"'^''"""^ «ampra^nto^S 'Ti.'°<"'.°I'° °^ 

of similar documentr Whe° ,^ ^'"^ ^ ""^^"on namef (e.g ^sS' h n -^'^ ""'^ J"'" 4«^atc 
dooiments. the featu^ ex^^^Jtor is abl !„ '^°"'"=«<'° » single Lme (er-lS F^ ^ ^'n' ""'^ i" » 

evidence ftom many documents To fin^ """e heuristics emninv./r^.u"'' Administration"), 

laiy. For example, it can often detfa ,f T"^^ ^nectly^^dje ZcTrf , k"?' '"odule 
diflferen. items ar^ reallj viTants ofih. " ^ tesfit trb'in "'^ ""^J^^'y °f 'he 
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tenn extraction module uses a e- ... 

identify multi-word techS te™l=f T'' 
heuristics, which areS on 1 * "^"^ 

of-speech irfonnatioo To? Engl^ wo^'^^'T^SP^^- 
simpfe pattern matching in ord« to fi^^ "''^S 
the noun phrase struc^iHf egressions having 

This Pio/c^mS^^erlC ^'"''^^^^^ ' 

tern, cxtracl't^l"^™^- 
callyintext— itisnotlii^»^fi^ automati- 
a separate Ite tHTlw ^"^""S '^""^ in supphed in 

of .eLsitp^'rbJS^TSrr l'''""'^ "^"-^ 
a document. Repetition fa Su J "^^^ ''^ within 
a .em, names a^n^« ftat t^ d^"°""! ^ ' ««« 
helping to ensure ^rihe eJ^f 
term. Furthermore when Te«^M °° "^'^'^ " 
collection of doc,^:ir^'J^^^'°i°8 is analyzing a large 15 

also helps to distinguiS' t"f°,7omr 
"aJtematc minimum uT4e™^^^^^^^^^ 
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0.0.2 austering 

"Jl|err...e minimum til-; K'^S^ ^ .--'^-SSSS^i" k""^ 

0 o^a^^SSs'"™ — ^^^'^ ^m^Jlt^!^-^^ 

(or sometWng clL^o'tfhl^lsot'"'- ^ ft*™ 

extraction or term extractiof .f, .'l^^' /Agnized byname 

the«.ofalread'^ir„red°CS.?^^'^«^ 
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thesetofa^eld^^iZS''"'''^'"'"*'™^^ ^ each other tran tol J'^ 

cal form. Other^Y/iT"'"'^ eating canoni- 30 ^ text mining cluIteSg i^^^'^'''"^"- 

form of a dcw^L,'^,' 'he canomcal «»««ction into subsetTL ° "^f*"' 

variant. ^""hulary item with the short fonn as its cluster beinr^iiar^^^.^'^'J?''' of 

Th. . features. For ni..L^„"'^"'!'=?'^ct to certoin interesting 

irnnnmu a. • 



:---ewvocaU;^ir--— ^^^^^ 

"EEPROM" and^'lecSy "^wr^OM" 
nizedasshortfonDsfor-elecrricXe™. M '''^ ""^"g" 
read-only memory" Text M?n7n„ Programmable 

tions involving ^oStfr^n^r^Sov"''""^^^^ 
0.0.1.2.4 Other Extractors 

Te« SLfaJ: ^Se^i T^T^^ '° effectively, 
analyze portions otTt^^ ,TZ^ "^^'^ 
recognizors for numbers, date? ^ 
extract information which fe r^f' . n ^"°™ts 
applications. potentially usefiil for certain 

'^^c^^lT^f^'^''"'^^^ °f 'he following 

canomSl repr^tt^o^f^rl^Sror ' 
mtegerorfixed-pointvahie a.s rrn *, f ^ appropriate 
hundred and tU y iv^r- ^'''^ 
"1^7" "twenty-seven ^rZ^ "^%»" '—'y-ven-. 

either^J^Lrc^^iiref.,^^^^^^ 
representation for them lie renre^Tn, ^ r*** " *=*"°°'"1 
(e.g., "next March 27.hl "^^^7!^^^^ <^''^' 
provides a ''reference date" ^fh P^'>cess 
calculator can intemret L^^ '^^'^ '° ^^'^ ' ^ate 
expressions, with the^alyj^-Sta?^^^^^^^ 



(as in ie IM rreV tKi^J^''^^ °f 'he ci^t^ri^l 
iPus. chislerinp w r^^rrrr^^gSSSSd ipthacolle^^ 
fl"stenn&tool^^£|^y^:;^^.of^^ 

group. Witiun a collection^f „v . documents in the 
defined as a grouo oTnhlf l""^ * "^"^^^ «»»« be 
sinrilartoeach2tltoCJr''r 
In text mining cluIteSg i^^^'t'''"^"y°*''8'°"P• 
collcc.ion in.o%ubsets™L"cS '° "n^'l-*?--' 
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features. For clusteZlT"" "te«sting 
ficatioo schemr^™i:^^''fl"ed t«conomy or dassi! 

ant:eSf^i:'-:^tL^^^^ ^ 'o 

E^VS^iX^rd^"^^^^ 
'"c:;fe;r"'^'"*'''^'-'^-«'<>f«'a'Bcdo^^^^ 

Identify hidden similarities 

'^^fo'^a^^ °' """-^^ ^ -ilar or related 

In P^S? rTS£e?S° ^ 
to reveal stmcture iX^S^VT <=^^"'^e 
so later. '"^t of a search, as described 

Clusters are discovered in data h» fi„w 
objects which are more sin,n»;, ^ndmg groups of 
members of any oth^raror-^i /""^ """^^ ""e 
analysis is ^^^Z^Z^TITo/''^ f ^ 
55 interKduster similarity is m^W,t^ of clusters such that 
larity is -aximizeTs'L 4Te^,t,?„^^^^^ -in"'" 
best solution to tiiis problem the ?n?iii w°° ""^"^ °' 
provides a tool based ^ a ro£. J Mmer for Text 

configured to match (he intended f"*^"" 
» 0.0Z1 HierarchicafieSg 

collections and inS^ appropnate for diflferent data 



12/08/2003, EAST version: 1.4.1 



us 6,446,061 Bl 



13 



Hierarchical clustering algorithms work especially well ^ 
for textual data. In contrast to flat clustering where the 
clusters have no genuine relationship, the dusters in a 
hierarchical approach are arrange in a clustering tree where 
related clusters occur in the same branch of the tree. The 
hierarchical clustering algorithm used by the clustering tool 
starts with a set of singleton clusters each containing a single 
document. These singleton clusters are the clusters at the 
bottom of the clustering tree, called leaves. It then identifies 
the two clusters in this set that are most similar and merges 
them together into a single cluster. This process is repeated 
until only a single,' Cluster, the root, is left. When merging 
two clusters their irjtra-cluster similarity is calculated. The 
singleton clusters have an inUra-cluster similarity of 100% 
since they contain only one document. The binary tree 
constructed during the clustering process contains the com- 
plete clustering information including all inter- and intra- 
cluster similarities. The inter-cluster similarity between two 
arbitrary clusters is the intra-cluster similarity of the first 
common cluster. The (binary) tree constructed during this 
process, called a dendrogram contains the complete cluster- 
ing information including all inter- and intra-cluster simi- 
larities. • ' 
• The binary tree^ngighLgetjyjgr^eep containin g a lot o f 
^ clus ters with j3Qly:^a Jew doc uments. Because these binary 
trees arc inconvenient to visilalize a configurable slicing 
technique is applied in a further processing step. Clusters 
within the same branch that have a comparable intra-cluster 
similarity are merged into a single cluster. Hiis reduces the 
depth of the tree and eases browsing or further prooessii^. 
The left diagram in FIG. 1 shows a graphical representation 
of a dendrogram. The vertical axis measures the similarity. 
The horizontal lines in the tree_ajs.^TO^tbsjsiniilarity 
level that fbnn ed the corresponding _clustery i.e., the intra- 
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resulting clustering would not be very useful in general. One 
alternative to using single words would be to perform a 
semantic analysis of the documents, in order to find the 
concepts mentioned in the text and use them for clustering. 
This kind of analysis is very expensive and furthermore 
depends on a lot of domain-dependent knowledge that has to 
be constructed manually or obtained fi-om other sources. 
Instead of taking this approach, the clustering tool uses 
lexical afl5nities instead of single words. A lexical afGnity is 
a correlated group of words which ap_gcarjfceqiicntl^^thin 
a sJjQtLdistaflcc of one another. Lexical afiEnities include 
phrases like "online library" or "computer hardware" as well 
as other less readable word groupings. They are generated 
djmamically, thus they are specific for each collection, A set 
of semantically rich terms can be obtail ^d without a need t o 
hand-code a specialized lexicon or a thesaurus. The cluster- 
ingibotirse!; a list of ihe lexical aihniiies in each document 
as the basis for its similarity calculation. A cluster can be • 
labeled with the lexical affinities it contains, which allows a 
user to quickly assess the characteristics of the cluster. 

Of course instead of a lexical affinity based feature 
extraction methodology for similarity calculation any other 
feature extraction methodology (for instance linguistic fea- 
ture extraction) may be used. 
0.0.3 Categorization 
y Q In general, to categorize o bjects means to a^agPJbem to 
^ predefined categorifeSTut J^lilsi^ from a t^ nomv. The 
categories may be overlapping or subsuming. For text 
mining, cate^nrizatinn nig^P'r «n ^^"' ffn categories to docu - 
ments]Dr to Qr]^ani7e dnainr|ffntg with respect to a p redefined 
organizatiQn. Tbesecould for exai^ie be the toiders'on a 
desktop, whkh are usually organized by topics or themes. 
Categorization in the context of text mining means to assign 
documents to preexisting categories, sometimes called top- 
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cluster similantyfor this cluster. The inter-cluster similarity 35 _ics or themes. The categories are 
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between two arbitrary clusters is the similarity level of the 
first conunon cluster. Pure dendrograms are often inconve- 
nient to visualize and interpret as Ihcy are binary. Therefore 
the configurable slicing technique has been applied that 
identifies the clust ers w ithin the tree which havs ..a compa- 
rable degree of intra-ciuster ^nifiSty/With this information 
new clusters iSeTormed by*~merging previous clusters. 
Slicing can be adjusted to individual needs. The clustering 
tool allows the user to set the upper and lower threshokis of 
intra-cluster similarity, and to define the number of slicing- 
steps at which slicing is peifonned. For example, specifying 
an upper thre^ld of 90%, a lower threshold of 10%, and 
the number of slice steps as 8, the clustering tool will check 
for potential clusters at similarity levels of 90%, 80%, 70%, 
60%, 40%, 30%, 20%, and finally 10%. The right diagram 
in FIG. 1 shows the result of slicing for the dendrogram on 
the left with the above parameter setting. In this case the two 
clusters {d8,d9} and {dlO} have been merged to a new 
cluster {d8,d9,dl0}. 

A convenient way to combine a coarse overview and 
detailed information of the clustering structure is to repeat 
the slicing at different levels of intra-cluster similarity 
thresholds. The combined output reflects the most important 
decisions of the hierarchical clustering algorithm at some 
snapshots, taken at different levels. 
0.0.2.2 Cbmputation of Document Similarity Using Lexical 
Affinities 

It is clear that the notion of similarity betv^e.eq^dQcumente 
and clusters is crucial. A ve ry simple simi larity measuce 
would be the degjcefig .of^overlap for„siDgle w ords in the 
lue to the ambiguity of single words, 
measure of similarity is rather imprecise, and the 
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ints aded use of thg_coUettip'n anil have to 
handTBy aligning documents to categories, text mining can 
help to organize them. While categorization cannot replace 
the kind of catabging a librarian does, it provides a much 

40 less.expensive. alternative. In addition it can be very useful 

in many other applications: 

to organize intranet documents. For example, documents '\ 
on an intranet might be divided into categories like 
Personal policy, Lotus Notes information, or Computer 
45 information. It has been estimated that it costs at least 
S25 to have a librarian catalogue an item. It is clearly 
impractical to catalogue the millions of documents on 
an intranet in this way. By using_autom£tic 
categorization, a ny document co Mld^e assi^ed to^an 
organization scheme and a h'nk t"i!!gj'^'=pg^!i^'"^t- 
ego rv could be gcncr gt^H aurnpatirally 
to assign documents to folders. Categorization can help to 
file documents in a smarter way. For example, it could 
help a person who has to assign e-mail to a set of 
folders by suggesting which folders should be consid- 
ered. 

The categorization tool assigns documents to predefined^ — i 
categories. For this purpose the cat egorization tool first has 
to be trained with a training set oonsisting of a collection of 



50 



55 



J4V 



60 sa mple doc uments tor e^ch cat ggory . lliese collections ar e 
use g^o,..create3 ^1gg ^ scheme. Th e trainmg uses the 
feature exiracuon toolm order to store only relevant infor- 
mation in the dictionary. The category scheme is a dictionary 
which encodes in a condensed form significant vocabulary 
65 statistics for each category. These statistics are used by 
categorization tool to..deter mine the^ ateg ory or catego ries 
whose sample documents arc closest to the documents at 
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h^. The purpose of the categorization algorithm is to automatically generate a taxonomy, which must fulfill the 
return a rankedhstjof categories for a given piece of text, same purpose as the keywords and which can be used to 
called query documenlTTSe'rank value is a sum over the detect new and unforeseen intra-document relationships, 
number of occurrences of all the different vocabulary items Further focus has been given to two major problems of text 
(i.e. canonical forms or their variants) in the query docu- 5 mining in a real-world environment: what are sufiScient 
mcnt. Each number of occurrences is weighted by a term criteria to consider two documents similar and how to 
that^akesjntQjaiccQ^ reduce the dimensionahty of a fiiU-text based domain, 

lary item in the category itself in proportion to its relative By analyzing and judging the taxonomies generated by 
frequency in the whole training set. Thus a word with a high ttte suggested method of taxonomy generation it may be 
frequency in a single category and a comparatively low lo shown that the current invention solves the following prob- 
frequency in the whole training set will have a higher weight icms of the state of the art: 
than a word that occurs very frequently in the training set or Navigational Balance 

a word that occurs less frequently in the category. Thus for The taxonomy must be well-balanced for navigation by an 
each category a rank value vidll be calculated with respect to end-user. In particular, the fan-out at each level of the 
the vocabulary of the query document. Thg^p utput of th e^s hierarchy must be limited (goal: less than 20), the depth 
catego rization tool is a text document that lis te each input 7 °3ust be limited (maximum 6 levels), and there must not be 
high cst7a^^|^g3;>ries. The empty nodes. Given this specifications, a standard tree 
niunBeTof the retiirned categories can be specified by the / browser can be used to browse the taxonomy, since it will fit 
user. The rank value is returned as well and can be used for / on a typical screen, 
further processing. -^20 Orientation 

The categorization approach may be based on any feature Nodes in the taxonomy should reflect "concepts'* and be 
extraction algorithm. adequately labeled so that the labels of the nodes give 

0.1 General Requirements and Structure of the Tm)nomy sufficient orientation for a user traversing the taxonomy. 
Generator Coherence and Selectivity 

Tlie discipline of text mining merges methods of infor- 25 l^^f nodes in the taxonomy should be maximally 

mation retrieval, computational linguistics and data mining coherent with all assigned documents having the same 
to achieve means for an automatic analysis of large and thematic cdptaaTTkclated documenlsjromjdijSiiHrnbdes 
unstructured corpora of digital documents. Although this sh6in d'appeari<1tHin'sfiort^isti no& the taxonomy struc- 
seems to be pretty close to data mining-like engineering for ture "Tor each non-leaf node, the selectivity of the computed 
large sets of raw transactional records, there are a few but 30 taxoiromy should be reasonable high, i.e., one must be able 
important differences. to anticipate what will be the diffcrenoe(s) between branches 

First, the question of similarity of two different texts can originating at this node, 
not be answered independent firom the intended solution. A Domain-independence of All Applied Technologies 
plain word-by-word comparison, statistical phrase detection The proposed method may not use any hand-coded 
or sophisticated names extraction methods may judge the 35 knowledge derived from an analysis of the given document 
same two documents to be similar or not — the acceptability collection to steer the taxonomy generation process, 
of this decision depends whether one must provide for Linguistic Methods to Be Integrated to Increase the Coo- 
instance a coarse overview of a company's patent text cepmal Quality of the Taxonomy 
collection, as oj^osed for instance to the task of detecting Conceptual quality is being defined as: 
- threads of important persons in a news wire text- stream. In* 40 coherence and selectivity of*document clusters and ' 
the first case, the semantic notion of similarity could be significance of extracted terms and phrases 
reduced to standard text processing techniques (stemming, The proposed method has lo cope with the ambiguity 
dictionary building and lookup), whereas the second case problem of natural language. Two feature exU-acting tech- • 
could require heuristics and parsing techniques as they are nologies can be exploited by the current teaching: namely 
developed m the area of Natural Language Processing. Fact 45 the Lexical AfiBnities (LA) and the Feature Extraction tech- 
e^rtraction from texts, and the harvesting of crisp and vague nology. Both techniques are compared against manually 
mformation, as it is done in information discovering edited texts, i.e., the keywords assigned to the document 
projects, however, require sophisticated knowledge models, collection by the editors to judge their appropriateness 
which tend to become domain specific. Scalability 

As for the second difference, text mining must deal with 5(yv Any document ofthe cpjlecjdon must be assignedjo^me 
a much higher dimensionality of features per entity For a leaf nqde^injhe "taxonomy and thT^l^SiBe'S must be 
real-world text database the number of potential features applicable to sigmficahtly larger databases than the sample 
range up to tens or hundred of thousands. This has a major provided. Therefore a separate routing task has been intro- 
impact, not only on the design of the mining algorithms duced to the proposed technique to guarantee scalability for 
used, but also puts a heavy burden on the subsequent 55 arbitrary coUection sizes and processing of forthcoming ^ 
visualization modules. documents. The routing task is to train a given taxonomy and 

The current invention especially solves the problem to sort all documents into appropriate clusters— both, during 
generate a taxonomy for an independently gained coUcction the buUd phase and for later use (adding new documents to 
of digital documents whereby the number of documents is the database), 
extremely large. Moreover the suggest method has been 60 Extensibility 

adapted to scale very well and to be able to even treat a The automatically generated taxonomies might not match 
number of documents, which is significantly larger. In case 'common sense*' expectations, of what is a familiar, weU- 
of the example to be used to demonstrate the current known index 'for a large corpus with a high diversity of 
myention, it consists of more than 70.000 documents cov- topics. Hence, the current invention shows well-defined 
enng one year of a news wire stream with medium sized 65 interfaces4o-.add-domaia=Qri^ted^ a 
documents (between hundreds and a thousand Unes of text). thesaurus) to impjcav^lhe^na^gaS^ of the 
The goal of the current invention was to provide a design to taxonomy tree, - - „ . . 



12/08/2003, EAST Version: 1.4.1 



17 



US 6,446,061 Bl 



^n^rr; '° ^^^^ f^^r'oratiaaand Evaluation '\ 

mus^be^ta.eaagatosj5£cont^^ / 

maiion; tor instancx a user conlH 
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mation; for instance riST^^aai^^^rS^p" / 
down bnowsiDg from liigh-lcvcl concepts to individual doci- 5 
ments (news article). __J 

niJ^i^^tK^ T'^'*' "^"^ P'o«ss architecture of the 
proposed method of generating a content taxonomy. 
1. The method of generating a content taxonomy according 
on w!? """f -'T «»H!Lby selecting (201) a subseT 

e'M^'e or tHe current inventionlhe generation^ 
based 00 a subset of 10% of the samples taken from the 
laige document database (73.000 documents). A tax- 

fnTT accord- 
ing the suggested method is stable already using a sample 
of up to 10% of the documents. Tie subset has been 
determined by random Selection: The subset selection 
process can be mq,roved if the random selection process 
is being performed based on documents with simUar size 
or similar date. Members of the sample then were selected 

rt^TlT ^•^'^^ the complete period of 

Umc which was spanned by the news collection 
J. The sample (211) is then piped into the corresponding 



r»„„, . .• T ^ . / • "'^ proouces a concise 
representation of each document of said subset, in form of 
the feature vectors of the subset documents (212) 

chicalclusterms-modffi=praj: — — * 

4. The label generation (204) as the next step works on the 

t?^t Z^' "^u" ^""^^"^ ^'"^^^ '° fo«=« consis. 
tency between the taxonomy structure and the assigned 
labels generating a labeled taxonomy (214) 

5. When dustering and labeling is finished, the data and the 
processmg forks into the categorization training (205) 

^^,"T^, '^'^S'^J^'' is performed^ same 
subset due to force consistency between the t«mnomy 
structure and the assigned labels. He categorization step 
Benerate.s-cai<>tTnrT, i^K»~„„ /-m-i : .. f '"^Hvu.aicpi 



combmed with word stemming to increase the recall of 
Identified topics. Although one loses discriminative power 2 
certain circumstances, stemmed lexical affinitiL have 
shown to produce much more dense relationship webs 

LAs are completely domain-independent. The only adap- 
tauon to a corpus required is to modify the stop word l^t 
used by the LA extractor. Meamngless and hSTreauen 

tr/v1"^ ("'f ^''^'^ ^y^""'*^ ""em on the s^S 
word hst. TTiis httle adaptation step is recommended for aw 
usage of and chistering. Another advantage of K fa 
lhatthis technology can be ported easily ,0 othe^r l^^aL 
The second technique applied was the computation of 
^iLmguigtic Features 0 f s)J n JheJext.sourees. LFextra^on 
IS ^ able to Identify and^k^T^I^^eople, oiganizatST 
locations, domam terms (multi-word ten^), and other sk-' 
nificant wor^ and phrases from texts. TT« Feature Ext^al- 

cZl?^ ".t * ^^^"^ analyses for tbfe tenns.*e.g 
cd^iah«iJn!fp.fiQSXofl^^ 

coltedisn Amam property of this tool is that it ass^iaies 
fonn and then treats them as a single tenn during further 

l^^Prf^A"'^^^' "Bill Clinton" 

and The Preadeni of the United States" 



20 



-r— V-—/ — jjijjcu uiio me comespondinc and "Th. Dr.«-.4.-. « . ' "-immn - 

feature extraction (202) tool which mtxhio^ , "eadenl of the United States". 
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generates category-schemes (215) for the (SSS'L ZnlST '•^^'^^sSlhi^gational balanc 
lZ.T!^lt '° '^."^ .'-f. °°<?- of Uxonomyl tfd SLTerw^l'Tr"!' 1 IP^Sf-measui^ 



ters assigned to the leaf n^es'of 'tbrt^^^nomyl'^d \ 
^[^^^S^nifi^beingjtefea^^ 

6. AHS^vards a final routing process (206) takes each 
unprocessed document of the document database (210) 4S 
and assigns ,1 to ib most appropriate chister. The chisters 
are presented to the user together with the precomputed 
d^^ba^'(2lT' taxonomy of the whole dociLent 
0.2 Pre-Processing Digital Documents: Feature Extraction 
Two preprocessing fiinctions are suggested to perform a 
mapping from raw text into countable and comparable 
di^ete variable values: feature extraction by lexical affinfty 
T^^^ "^"^uistic feature (LF) technology 

Acmally both technology approaches are providing featS^ 
extraction capability based on certain "linguistic" pfoperlies 
of the documents. Nevertheless the leraikology of "kxicS 
affimty 0^)" and "linguistic feature (LF)" f ,^ d k the 
current des«ipiion have to understood in {he more nlrr^w 
sense as defined below. "oirow 

ofJfh^^cl^^'*^^^ ^'^'^ occurrences 
of phrases hke "service provider" or "computer software", 
which- have shown to a have a much higher semantic 
meaning compared to the single terms "service" 
fiTr H ' "' '^'"'io"^" I" the current implementation a' 
five word wmdow to Identify co-ocairring words was of 
specific advantage. This fast and powerful method has been 



■ t • • ri, — -'»"-»-"uu icuuj. 10 oe roo productive for 

toti^ ri*" ««nPutation of the term-vector represe^ 
totion of dQSBBgnts and clusters. To overcome this ^he 
selcctaon of terms for the clustering step can be bS on 

gives the LF-based approach the advantage that more 
pwameters can be changed in the process to tune °h^ 
taxonomy towards a well balanced view of the data withom 

and mulu-worf terms as inEUtfeatures for clustering in a 
PcrsoBs.oompamg etc.") leads to ve^ 



I J """'B " iyusemeasure (i.e.. all 

(smgle) words that are not part of IhTlt^liSSirto) 
^M^i^^£l!!2tercober^^ 

woric , system , "information", . ) 
As a solution to this problem the invention suggests to use 
names, terms and general words, but to apply fiUerine to 
remove high-frequency terms and very loUeq^Ly 
fu^l^ n""^ high^requency terms basically is dose t^ 
automatically extending a stop word list Ffltering lov^ 

uxonomy. '^"^ """^ °* ^ 

0.3 Hierarchical Qustering 
V3 The hierarchical clustering algorithm (HCA) exploited by 

55 t ZlTTf°° ^ " Igorithm'ltsUriTwit^ 
r documents and builds the bottom dusters 
first. It then works its way upward in the hierarchy forming 
higher-level dusters by grouping together rela J clusters 
Theadv^jage^HCA^ppna^^ ^• 
gooTooherenM atjelo^^ 

Usually , the first level ot the generated hTe archy I 

Twin -^^'^ '""^P'^**" °f ^'^^ 'he suppC 
keywords as mput) so this remains an issue for application 
programmmg when buflding a real solution based on E 
technology. Scalability is adiieved in this approach by 
bmldmg the taxonomy only on a subset of the'documenu 
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like most agglomerative duster algorithms, the HCA 
internally builds up a complete cluster hierarchy (a 
dendrogram). This hierarchy relates each pair of documents 
and annotates this relationship with the corresponding 
degree of similarity. Slicing a dendrogram into layers with 
certain ranges of similarity allows to speak of high-level 
concepts (with a lower intra-clustcr similarity) and more 
concrete topics (closer to the leafs of the dendrogram, with 
high similarities). Since the routing task has been assigned 
to a subsequent processing step, it was possible to identify 
the optimal range specification to ease the process of brows- 
ing the final taxonomy, which essentially is a dendrogram 
sliced into (a maximum of) six levels. Of couise the pro- 
posed approach allows to adjust the generated hierarchy to 
an desired (predefined) number of levels. 

A fundamental advantage of the current invention is; 
working on a subset does not impact quaUty of the generated 

taxonomy hierarchy. Chister runs with different subset sizes 

otherwise linchariged paramctcis showed that the tax- 
onomy becomes 'stable' at about 10% of the documents (for 
the current example), i.e. the resulting structure did not 
change significanUy when increasing the sample size. Thus 
to adiieve a suflScient sampling of a heterogeneous text 
collection, one must create a taxonomy based upon up to 
10% of the topics in the collection or work with at least 50 00 
documents to operate the HCAalt ^nrithm within i^Si^ticil 
reas onable base of docune nte. The point of saturation 
changes slightly depending on the input preprocessing 
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egory scheme can be trained as follows. The categories were 
taken to be the leaves of the taxonomy with the documents 
assigned to them as training material. The table gives details 
about resource requirements and performance of the two 
category schemes: 
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Taxonomy 


Full linguistic 




prejvocessing 


statistical pfarase-enalysis 


# Categories 


686 


686 


# Training documents 


3.729 


6534 


Training set size 


18.8 MB 


31.1 MB 


Scheme file size 


35 MB 


41 MB 


Training time 


20 min 


40 min 



techmque, which has been applied. For^njut data h^i yj pp; a 
lower dimensionality of features (supplied keywords, x-rL-uaacuc 
LA-based preprocesang), it turns out that no more signifi- 30 categorizer. 



With both the category schemes the routing task was 
P^^*P""f ^ on ^^ ot th e 73031 documents . The complete 
routing into one taxonorny took around 100 minutes, i.e., the 
routing speed is above 200MB/h. Each document was 
20 assigned to the taxonomy category that received the highest 
rank vahie from the topic categorizer. Of those documents 
that constituted the training set the correct classification 
could be verified in 99.95% (LF), and 99.6% (LA). Cleariy, 
this nearly perfect reclassification for the LF taxonomy is to 
25 be expected on the grounds that the clusters created by the 
clustering engine define themselves through a similarity 
measure akin to the one used in the classification step. But 
also the low reclassification error for LA shows that the 
LA'based category scheme can be trained very well by the 



cant changes above -4000 documents (if sampled properly) 
is achieved. For higher dimensional input the size of the 
sample has to be increased by some extra data (i.e. 
documents). 

The quality achievable for the routing task with this size 
of training data is very high (see the next section on the 
accuracy of the router). Clustering qfj^ 5000 document 
sample takes about 1.5 hours^ j RS/60m 4^jp7Thi\. 
relativei^r short elapsed time^forlaxonomy generation is of 
great advantage because it-allows for short turnaround 
cycles while experimenting with various process setups. 
0.4 Routing New Documents 

According to the general sUucture of the taxonomy gen- 
eration algorithm of the current invention its purpose is not 



0.5 Results 

By applying lexical-afBnity extraction as feature extrac- ^ 
tion technology, a taxonomy has been achieved for an 
example of more than 70,000 news docu m ents which w as 
35 suited foi L navigation, using npTn ft Ip^J^ jf inquired to 
visualize different aspects of concepts contained in the 
clusters. It shows an average branching factor of 3 with no 
more than 12 branches in the inner parts, a reasonably sized 
second-level structure (252 nodes), and a size of the leaf 
w nodes that is well-balanced. The root' of * ttie""taxonomy^ 
reflects the different topics as they can be found in such a 
large collection. 
tyr FIG. 3 shows an example of a node of the generated 
taxonomy for the current example taken from the LA-based 



„ 1 . ^ - , ^ t'-at^^ la^uuuujy lor inc current example taicen irom tlie LA-based 

bfaKr^hf.'^T document collection (the subset), 45 taxonomy. A node represents f cluster, prigl^^^tft 

butalso to be able to classify new locomuig documents. hand side, and the outgoing links, pdn^^&^i hand 

'^deJJeVcrJn '"TT "^^^ side. It either links tom le4lK. to kvel x7l o^^^ 

rTT ' '"^"'} ^ u *=T'°' ^ *^ ^ " °°de) il shows . tcrmind duster which Ifatslte 

nodes of the taxonomy denved by the hierarchical cluster- titles of dl documents belonging to this ctosSee^S^ 

mg_lnatrammgpha5e^^^^^ «°de is taken from the seco^layer of ttelS«X!ua 

^ -^!B^lJo2H«^nU .fjau.each^ate^^^ Mnks from level 2 to level 3 (where level 1 is the ^\ of the 

'^J'^^^^^^^^iSSm^c^^^^ This tree). In cases like the one shown here, where toe sub 

trammg uses the same feature extraction technology and clusters have similar top-frequent terms (NT, window) the 

ools as mentioned above, and toe vocabulary iteinsactua^^ output is augmented by dislbiguating terti^^aken fern 

thouihtnfr, f ff r- '''^'^ ^ « «he cluster rep,esentation.TT.us,TLr4l always naviSte 

thoughtof asasetof featurevectois,onepercategory.inthc within a semantic context, showing the alternatives for the 

muhidmiensional vector space where each feature cone- actual browsing decision 

sponds to a dimension. Each vector encodes frequency The taxonomy shows very good coherence i e related 

3raiuir±'""''r f • • . documentsare^oupedclo^ly^ogetoer.Ac<^rdtokote 

aJed ^T^J^T^^Z^'^''^^"'^'^'^'^^'''' ^ l'^^^''^ embodiment of the current invention toe 5 most 



called query document, the processing step computes its 
featitre vector using the feature extraction tool in the most 
efficient mode. It then returns a ranked list of categories, 
sorted according to the angle between the query document 
vector and the respective category vector. 

For both versions of the taxonomy, based upon full 
linguistic preprocessing and upon lexical affinities, a cat- 



frequent distinguishing LAs within each cluster have been 
selected as the cluster label. Thus similar concepts like 
"computer hardware" and "computer software" are dis- 
played as "computer, hardware" and "computer, software", 
65 if they happen to occur together in a cluster stmcture! 
Overall, the labels represent the cluster content well and 
allow for easy orientation towards interesting sub clusters. 
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The snapshot of FIG. 4 is an example taken from the 
generated taxonomy which was generated based upon the 
linguistic-feature-based technology extracting names, terms 
and general words. Overall, the taxonomy is broader than 
the lA-based taxonomy at the first and second level (73 and 
283 nodes, respectively). Average branching factor is 3 with 
a maximum of 7 in the inner parts. Coherence is very good 
throughout the taxonomy. The 5 most frequent LFs within 
each cluster have been selected as the cluster label. Since the 
LFs are a mixture of names, phrases, and single words, 
orientation is slightly worse than for LAs in some cases. 
Readability, though, is pretty good. 

In general the generated taxonomies are well-balanced 
and coherent down from level 2 of the hierarchy. The leaf 
nodes of the taxonomies are very coherent and related 
documents are within short distance in the taxonomy. Select- 
ing the n most £nequent terms in a cluster for labeling gives 
good overall orientation in the structure.' 

We claim: 

1. A computer-executable method of generating a content 
taxonomy of a multitude of documents (210) stored on a 
computer system, said method comprising: 
a subset-selection-step (201), for selecting a subset of said 

multitude of documents; 
a taxonomy-generation-step (202 to 205), for generating 

a taxonomy for said subset, 
wherein said taxonomy is a tree-structured taxonomy- 
hierarchy, and 

wherein said subset is divided into a set of clusters with 
largest intra-similarity, and 

wherein each of said clusters of largest intra-similarity is 
assigned to a leaf-node of said taxonomy-hierarchy as 
outer-clusters, and 

wherein inner-nodes of said taxonomy-hierarchy order 
said subset, starting with said outer-clusters^ into inner- 
clusters with increasing cluster size and decreasing 
similarity, and 

wherein said taxonomy-generation-step further comprises 
a^first-feature-:extraction-step (202) for extracting for 
each document of said subset its features, and for 
computing its feature statistics in a feature-vector (212) 
as a representation of said document; and 

a routing-selection-step (20S), for computing, for each 
uiiprocessed document of said multitude of documents 
not belonging to said subset, similarities with said 
outer-clusters, and for assigning said document to the 
leaf-node of said taxonomy-hierarchy being the outer- 
cluster with largest similarly. 

2. The method of claim 1, wherein said taxonomy- 
generation-step further comprises a clustering-step (203) 
using a hierarchical clustering algorithm for generating said 
taxonomy-hierarchy, and using said feanire-vectors for 
determining similarity. 

3. The method of claim 1, wherein said features are 
extracted based on lexical affinities within said documents. 

4. The method of claim 3, wherein said lexical affinities 
are extracted with a window of M words to identify 
co-occurring words. 

5. The method of claim 4, wherein M is a natural number 
with 1<M^5. 

6. The method of claim 1, wherein said features are 
extracted based on linguistic features within said documents. 

7. The method of claim 1, wherein extracted features are 
selectively ignored based on statistical frequency extremes. 

8. The method of claim 1, wherein the depth of the 
taxonomy-hierarchy is limited to L levels by using a slicing 



technique to merge most similar clusters into one cluster 
until said taxonomy-hierarchy includes L levels. 

9. The method of claim 8, wherein L is a natural number 
from the range 1^L^12. 
S 10. The method of claim 1, wherein said taxonomy- 
generation-step further comprises a labeling-stcp (204) 
labeling each node in the taxonomy-hierarchy. 

11. The method of claim 10, wherein the N most frequent 
distinguishing features of a cluster of a node in the 

10 taxonomy-hierarchy are used as labels. 

12. The method of claim 11, wherein N is a natural 
number with I^N^IO. 

13. The method of claim 1, wherein said subset of said 
multitude of documents is determined by random selection. 

15 .14. The method of claim 13, wherein the range of the 
document dates is divided into equally sized sub-ranges and 
said random selection is performed separately for documents 
with document dates from said sub-ranjges. 

15. The method of claim 1, wherein said subset comprises 
20 up to 10% of said multitude of said documents. 

16. A computer program product comprising a computer 
usable medium having computer readable program code 
means embodied in said medium for generating a content 
taxonomy of a multitude of documents stored on a computer 

25 system, said computer readable program code means com- 
prising: 

a subset selector for selecting a subset of said multitude of 
documents; 

a taxonomy generator for generating a taxonomy for said 
subset, wherein said taxonomy generator further com- 
prises a first-feature-extractor for extracting for each 
document of said subset its features, and for computing 
its feature statistics in a feature-vector as a represen- 
tation of said document; and 
a routing selector for computing, for each unprocessed 
document of said multitude of documents not belong- 
ing to said subset, similarities with said outer-clusters 
and for assigning said document to the leaf-node of said 
taxonomy-hierarchy being the outer-cluster with larg- 
esrisimilarity,'' 

wherein said taxonomy is a tree-structured taxonomy- 
hierarchy, and 
wherein said subset is divided into a set of clusters with 

largest intra-similarity, and 
wherein each of said clusters of largest intra-similarity 
is assigned to a leaf-node of said taxonomy- 
hierarchy as outer-clusters, and 
wherein inner-nodes of said taxonomy-hierarchy order 
said subset, starting with said outer-clusters, into 
inner-clusters with increasing cluster size and 
decreasing similarity. 
17. A system for generating a content taxonomy of a 
muhitude of documents stored on a computer system, said 
system comprising: 
means for selecting a subset of said multitude of docu- 
ments; 

means for generating a taxonomy for said subset, 
wherein said taxonomy is a tree-structured taxonomy- 
60 hierarchy, and 

wherein said subset is divided into a set of clusters with 

largest intra-similarity, and 
wherein each of said clusters of largest intra-similarity 
is assigned to a leaf-node of said taxonomy- 
65 hierarchy as outer-clusters, and 

wherein inner-nodes of said taxonomy-hierarchy order 
said subset, starting with said outer-clustcis, into 
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inner-clusters with increasing cluster size and 23.Tbesystemof claim 17, wherein extracted features are 

decreasing similarity, and selectively ignored based on statistical frequency exU*emes. 

wherein said means for generating further comprises a 24. The system of claim 17, wherein the depth of the 

firsl-feamre-extractor means for extracting for each taxonomy-hierarchy is limited to L levels by using a slicing 

document of said subset its features, and for com- 5 technique to merge most similar clusters into one cluster 

puting its feature statistics in a feattirc-vector as a ^^ni said taxonomy-hierarchy includes L levels, 

representation of said document; and 25. The system of claim 24, wherein L is a natural number 

means for computing, for each unprocessed document of from the range 1<L^ 12. 

said multitude of documents not belonging to said 26. The system of claim 17, wherein said Uxonomy- 

subset, similarities with said outer-clusters, and for 10 generation-tool further comprises a labcling-tool labeKng 

assignmg said document to the leaf-node of said ^ach node in the taxonomy-hierarchy. 

iTsS?'"'^^^ outer-cluster with larg- 27. The system of claim 26, wherein the N most frequent 

rn. . c , ' . -J r distinguishing features of a cluster of a node in the 

18. Ine system of clama 17, wherem said means for . l- u j i i_ i 

1 * • * 1 - K taxonomy-hierarchy are used as labels, 

generating further comprises a clustenng-tool usmg a hier- i5 J i t • ^^^^ j_ * kt • ^ i ^ 

u- 1 t ♦ ' 1 r ^ -J. 28. The system of claim 27. wherem N IS a natural number 

archical clustermg algorithm for generating said taxonomy- rh 1 <Nr<in 

hierarchy, and using said feature-vectois for determining 

similarity' ' ' 29. The system of claim 17, wherein said subset of said 

19. The system of claim 17, wterein said features are multitude of documents is determined by random selection, 
extracted based on lexical affinities within said documents. 20 system of claim 29, wherein the range of the 

20. The system of claim 19, wherein said lexical affinities document dates is divided into equally sized sub-ranges and 
are extracted with a window of M words to identify said random selection is performed separately for documents 
co-occuring words. with document dates from said sub-ranges. 

21. The system of claim 20, wherein M is a natural 31. The system of claim 17, wherein said subset corn- 
number with 1<M^5. ^ prises up to 10% of said multitude of said documents. 

22. The system of claim 17, wherein said features are 

extracted based on linguistic features within said documents. * * :» « « 
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